Two chromosome-level genomes of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae)

Chironomids are one of the most abundant aquatic insects and are widely distributed in various biological communities. However, the lack of high-quality genomes has hindered our ability to study the evolution and ecology of this group. Here, we used Nanopore long reads and Hi-C data to produce two chromosome-level genomes from mixed genomic data. The genomes of Smittia aterrima (SateA) and Smittia pratorum (SateB) were assembled into three chromosomes, with sizes of 78.45 Mb and 71.56 Mb, scaffold N50 lengths of 25.73 and 23.53 Mb, and BUSCO completeness of 98.5% and 97.8% (n = 1,367), 5.68 Mb (7.24%) and 1.94 Mb (2.72%) of repetitive elements, and predicted 12,330 (97.70% BUSCO completeness) and 11,250 (97.40%) protein-coding genes, respectively. These high-quality genomes will serve as valuable resources for comprehending the evolution and environmental adaptation of chironomids.

To gain a better understanding of the fundamental mechanisms of tolerance, researchers have increasingly focused on the genomes of organisms capable of thriving in extreme environments.Three notable examples include the chironomid midges Polypedilum vanderplanki, which can survive near complete water loss in its larval form, Belgica antarctica, which exhibits remarkable freeze tolerance, and Propsilocerus akamusi, a species recognized for its ability to tolerate pollution [5][6][7] .Such genomes had shaped by their environment and the adaptations they undergo to survive and reproduce.The Orthocladiinae, which is the largest subfamily within Chironomidae, is characterized by a remarkable diversity of species that have evolved varied ecological and physiological adaptations to their respective environments.Among the Orthocladiinae, the genus Smittia, encompassing species such as S. aterrima and S. pratorum, was notable for its thriving in littoral environments owing to their unique terrestrial/amphibious way of life, as the majority of chironomid larvae are aquatic [8][9][10] .Besides, research on Smittia sp. has been instrumental in advancing our understanding of parthenogenesis, polytene chromosomes, embryonic development, and nucleolar RNA synthesis, demonstrating significant scientific importance [11][12][13][14][15][16] .
With the multiple purposes of generating genomic resources to investigate chironomid genome evolution, chromosomal composition, and tolerant specialization, we utilized Oxford Nanopore Technologies (ONT) and high-throughput chromosome conformation capture (Hi-C) sequencing to produce two chromosome-level reference genomes for S. aterrima and S. pratorum.We performed genome assembly, and annotation, and conducted evolutionary analyses.Our work generates a valuable chromosome-level genomic resource for chironomids, establishing a foundation for future research into the environmental tolerance mechanisms of these insects.

Methods
Sample collection and sequencing.Live male adult specimens of S. aterrima and S. pratorum were collected at Baitan Lake (Huanggang City, Hubei Province, China, 30.463250°N, 114.942184°E).After sample collection, the whole bodies of samples were immediately immersed into liquid nitrogen and stored at −80 °C.There were 300 male adults (about 200 male adults of S. aterrima and 100 male adults of S. pratorum were mixed together) used for genome sequencing.
DNA was extracted using the 1D DNA Ligation Sequencing kit SQK-LSK109.RNA was extracted with the TRIzol ™ Reagent kit.After the determination of the DNA quality and quantity, a paired-end sequencing library (350 bp in length) was constructed and sequenced using the Beijing Genomics Institute (BGI), and the library construction was completed by Berry Genomic Corporation (Beijing, China).In addition, a Single Molecule Real-Time DNA library was prepared for sequencing using SQK-LSK109 Kit with an insert size of 30 kb.  1).The 28.11 Gb Illumina DNA data was retained after the quality control process and then used for the genome survey.The k-mer (k = 21) analysis demonstrated that the genomes with a low heterozygous ranging from 0.70%-0.85%(Fig. 1a, Table 2), and the estimated size was about 83.52-84.51Mb.
Genome size estimation and assembly.Quality control of the BGI data was carried out by BBTools v38.82 18 : "clumpify.sh" is used to remove repeats; "bbduk.sh" is used for specific quality control, i. e. removing sites with a base mass score below 20 (>Q20), filtering sequences less than 15 bp, removing poly-A/G/C ends over 10 bp, and correcting bases using the overlap region (overlapping reads).The k-mer frequencies were assessed using "khist.sh"(BBTools) with a length set to 21 k-mer.The k-mer analysis was then performed using GenomeScope v2.0 19 , with a maximum k-mer coverage of 1,000 ("-m 1000").
For genomic contig assembly, the ONT raw data were error-corrected by NextDenovo v2.5.0 (https://github.com/Nextomics/NextDenovo), filtered for contaminated sequences using Kraken v2.1.2 20, and then assembled using NextDenovo software with parameters read_cutoff = 1k.Sequences below 1 kb in the raw data were filtered.One round of long sequence correction using Inspector v1.2 21 and two rounds of short sequence correction with NextPolish v1.3.0 22 to obtain the corrected genome sequence and further improve the assembly accuracy (Table 3).Minimap2 v.2.17 23 was employed as the read Fundamental mapper during long and short-read polishing stages.
In order to obtain clean data, the adapter sequences of raw reads were trimmed and low-quality reads were removed using Juicer v1.6.2 24 .Subsequently, the clean reads were mapped to the draft genome into the chromosome using 3D-DNA.Juicebox v1.11.08 24 was used to correct possible errors (such as misjoins, translocations, and inversions) in the candidate assembly by visualizing Hi-C heatmaps.Judging from the Hi-C heatmap, information on both species (S. aterrima and S. pratorum) was obtained simultaneously (Table 4, Fig. 1b).Possible contaminants were detected using MMseqs. 2 v11 25 , which performed BLASTN-like searches based on the NCBI nucleotide (nt) and UniVec databases.The completeness of the genome was evaluated using BUSCO v3.0.2 26 with insecta_odb10 dataset (n = 1,367 single-copy orthologues) and BUSCO v5.4.4 27 with diptera_ odb10 dataset (n = 3,285 single-copy orthologues).To calculate the mapping rate, we mapped ONT long reads and BGI short reads to the assembly using Minimap2.We then calculated the mapping rate using SAMtools v.1.10 28with the 'flagstat' parameter.Finally, the genomes of S. aterrima and S. pratorum were assembled into three chromosomes with sizes of 78.45 Mb and 71.56 Mb, the scaffold N50 lengths evaluated with insecta_odb10 dataset were 25.73 Mb and 23.53 Mb, while the GC content was 36.93% and 41.72%, respectively.(Table 5).The results evaluated with diptera_odb10 dataset in Table 6.

Genome annotation.
Genomes are often annotated with repeat sequences, protein-coding genes, and non-coding RNA.
Two strategies were used for the annotation of gene functions.We conducted the gene functional annotation search against the UniProtKB (SwissProt + TrEMBL) 46 and the nonredundant protein sequence database (NR) using the sensitive mode of Diamond v2.0.11.149 47 in sensitive mode with the parameters "-very-sensitive -e 1e-5".We further employed eggNOG-mapper v2.1.5 48and InterProScan 5.53-87.0 49to assign Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Reactome pathway annotations and to identify protein domains.Four databases including protein families (Pfam) 50 , SMART 51 , Superfamily 52 , CDD 53 were searched by InterProScan.The results predicted by the above tools were integrated to obtain the final prediction of gene functions.For S. aterrima and S. pratorum, a high percentage of annotated genes matched the UniProtKB database, with 11,740 (95.21%) and 10,835 (96.31%) genes respectively.The InterProScan database identified protein domains in 9,419/8,811 protein-coding genes, while 10,135/9,447 GO and 4,746/4,491 KEGG were identified by InterProScan and eggNOG-mapper.Furthermore, 7140/6695 genes were annotated as GO terms, 7673/7202 as KEGG ko terms, 2749/2614 as Enzyme Codes, 4746/4491 as KEGG pathways, 9419/8811 as Reactome pathways, and 10762/10047 as COG functional categories (Table 7).

Data Records
All raw sequencing data and the genome assembly of S. aterrima and S. pratorum underlying this article are available at the NCBI and can be accessed with Bioproject ID PRJNA809421.The Nanopore, Illumina, Hi-C, and transcriptome data can be found under identifcation numbers SRR23797681-SRR23797685 [54][55][56][57][58] .The assembled genome has been deposited in the NCBI assembly with the accession number GCA_033063855.1 59 and GCA_033064975.1 60 .All annotations data for repeated sequences, gene structure, and functional prediction are available for download through Figshare 61 (https://doi.org/10.6084/m9.figshare.22762118).

technical Validation
Two genome assembly methods, NextDenovo, and 3D-DNA, were executed and subsequently compared (Table 3).The NextDenovo assembly genome was slightly larger than predicted at 151.48 Mb and contained 78 primary contigs, with an N50 contig length of 3.39 Mb, the longest contig of 9.63 Mb, and 98.40% of BUSCO genes.The average GC content was 39.16%.After haplotig purging and genome polishing, the genome length was 151.31 Mb, contained 78 primary contigs with an N50 contig length of 3.38 Mb, the longest contig of 9.62 Mb, and 99.10% of BUSCO genes.The 3D-DNA assembly showed a remarkable improvement, yielding 89 scaffolds with a scaffold N50 length of 25.73 Mb, including the longest scaffold of 29.79 Mb.Additionally, the BUSCO completeness percentage achieved in the 3D-DNA version was 99.2%, with negligible levels of fragmentation and missing genes at 0.00% and 0.80% respectively.After utilizing Hi-C long-range scaffolding to enhance our assembly, we securely anchored the assembled scaffolds into six chromosomes that can be categorized into two species; S. aterrima (SateA) and S. pratorum (SateB) (Fig. 1b).Subsequently, we imported the 3D-DNA assembly to obtain the final results of the chromosome assembly.The final genome assembly showed a BUSCO completeness of 98.5% and 97.8% (n = 1,367), and duplicated BUSCOs were 1.3% and 0.9%, respectively.Additionally, the high mapping ratios of both BGI and ONT data were 93.38% and 92.35%, respectively.These indicators collectively demonstrate that the assembly has achieved a remarkable degree of continuity and integrity (Table 5).

Fig. 1
Fig. 1 (a) The survey results were obtained based on the 21 k-mer analysis.(b) Heatmap showing genomewide all-by-all Hi-C interactions (three chromosomes for each species of S. aterrima (SateA) and S. pratorum (SateB)).The map indicates that intrachromosomal interactions (red blocks in the diagonal) were stronger than interchromosomal interactions.
MboI as the restriction enzyme.Finally, we obtained 93.95 Gb of sequencing data, comprising 28.11 Gb of Illumina reads, 21.16 Gb of Nanopore reads, 13.40 Gb of Hi-C data, and 31.28Gb of RNA data, which consisted of 21.78 Gb of Illumina sequencing and 9.50 Gb of ONT sequencing.The mean/N50 lengths of the Nanopore and ONT reads were 6.01/21.65 kb and 0.99/1.41kb, respectively (Table

Table 1 .
Raw data from the different sequencing.Note: ONT, Oxford Nanopore Technologies.

Table 2 .
Results of the Suvery analysis.

Table 4 .
The 32romosome length and sequencing coverage of S. aterrima (SateA) and S. pratorum (SateB).andfinally the software RepeatMasker v4.1.2p132withthe default commands was used to predict the repeat sequence according to the custom library.The genomes of S. aterrima and S. pratorum produced a total of

Table 7 .
Genome annotation and functional annotation results of S. aterrima and S. pratorum.